Mohammad Najm-Araghi, University of
Konstanz, najm.araghi@googlemail.com
Sergey Pulnikov,
University of Konstanz, sergey.pulnikov@googlemail.com
Dr. Peter Bak, University of Konstanz, bak@dbvis.inf.uni-konstanz.de
We have developed our own tool for preprocessing and some visualizations.
To visualize the data we also use Many Eyes online
tool. Many Eyes was developed by IBM's research group in 2004.
To draw some of graphs we use network analysis
application called visone. The origins of the visone project lie in a single
link between the Algorithms & Data Structures
Group in the Department of Computer & Information Science,
and the Domestic Politics & Public Administration
Group in the Department of Politics & Management,
both at the Universität Konstanz.To achive the same
result as we have, you don't need any knowledge
in programming. With our tool everyone is able to
create the same visualizations as we have in less
than an hour.
Video:
http://ava.dbvis.de/AVA_MC1.mp4
ANSWERS:
MC1.1: Summarize the activities that happened in each country with
respect to illegal arms deals based on a synthesis of the information from the
different report types and sources. State the situation in each country at the
end of the period (i.e. the end of the information you have been given) with
respect to illegal arms deals being pursued. Present a hypothesis about the
next activities you expect to take place, with respect to the people, groups,
and countries.
1 Introduction
We have
developed a tool for data mining from reports that visualize the required
information. From each report, we extract locations, persons, organizations and
dates. At the next step, we eliminate duplicates and save results in our
internal data structure. Our tool provides several ways to create an
interactive visualization of the extracted data. To show the summarized
activities that happened in each country with respect to illegal arms dealing
we have created an interactive World Map Visualization. To make recent
activities more important, we also implemented a weight function.
The
implementation took about a week of work. Over the whole period we had some
manual tasks between the implementation of our tool. Finding previous works
that fit in our approach, editing the findings manually and searching
appropriate visualizations took a bit more than one week of work. Finally the
whole project retard about 2 month with a interesting solution that will
be presented in the next chapters.
2 Knowledge Discovery in Text and Visual Analysis
The reports about
illegal arms dealing that we need to analyze are text files. Most reports
contain information about locations, persons and sometime organizations that
are involved into illegal arms dealing. Sources for some of these reports are
telephone calls, mails and blogs.
Our first goal
was to extract relevant information from the reports for further analysis. To
do this, we have developed our own tool based on Stanford Named Entity
Recognizer (NER) [3]. NER labels sequences of words in a text which have names.
We extended this library to label dates as well. All duplicates were
eliminated. From each report we get for example: Locations: Turkey Persons:
Hakan Organizations: United Arab Emirates Date: 16.12.2008. The manual process that includes the web
search for existing tool, took about 3 hours.
Fig. 1. Bar-plot visualization shows the countries, ordered by frequency of occurrences.
To summarize
the activities that happened in each country (Figure 1), we created a list of all
countries that occur in reports with a number of appearances in all reports
(without counting duplicates in each report). However, our main interest was to
show not only the frequency, but also the temporal deviations in occurrence. In
order to achieve this aim, we calculated a weight function that makes
occurrences that happened recently more important than ones that occured in the
past. Results are shown in Figure 2. The manual analysis took a 2 hours. The
extension of the tool was in this case the more weighty task and costs about 4
hours.
Our tool can
visualize this data in several ways.
·
Interactive
bar-chart implemented within the tool
·
Export
data for IBM many eyes visualization [4]
Fig. 2. World map shows the frequency of
occurrences weighted by their temporal parameter (recent events are more
important) in a combined “interestingness factor”. The saturation of countries
increases with the importance of the country.
An overview
over persons and organizations can be also visualized with bar charts in our
tool. In order to
enhance the quality of the extracted information, we extend
our tool to allow manual changes of preprocessing [1]. To make results better,
we invested about 20 minutes in manual changes of countries. For example,
Moscow was changed to Russia, and so on. To visualize all persons and
connections between them, we use Many Eyes and export the corresponding data
with our tool.
Fig. 3. Social
network graph with all actors extracted by our tool. Each sub-graph corresponds
to one social network.
As you can see
from this (Figure 3) visualization, not all persons are connected with each
other. There are several isolated groups of persons. We now aimed in a combined view of persons and
countries. In order to avoid an overload of such a combined graph we
implemented a filtered visualization, where we can manually set an importance
boundary for country and for person filtering. This filtered visualization
provides us with several hypotheses about future activities. The hypotheses were that strong connections and also
multiple ones exist between the most important actors and the location of their
activities
Fig. 4. A subset
of a social network combined with locations of activities. The most probable arm dealing will take place
between Nicolai and Saleh Ahmed, in Russia and Yemen.
The filtered
view gives more clearance for some assumptions. There are four countries left
that participate in the arms dealing network, therefore future events could be
:
·
Russia will deal with Ukraine
·
Russia, Ukraine and Yemen will be participated
in the next deal
If we consider
the acteurs in the graph, the connections between them and the countries there
are some other possible hypothesis based on the cardinality of the nodes.
·
Maulana Haq Bukhari will start a deal in Pakistan
·
Muhammad Kasem and Akram Basri will form a
vicious triangle
·
Nicolai, Saleh Ahmed will deal within three
countries: Russia, Ukraine, Yemen
These
hypotheses are just based on the number associated with each node. Each
connection means that the nodes appeared in the same text corpus.
To extract the
most probable theory, we created a time series visualization in the tool that
shows us the activity of country/person over the whole time period of reports. The implementation of
the time-series visualization took approximately 4 hours. The results are shown
in Figure 5.
Our results show that most likely Nicolai will deal with Saleh Ahmed, and these deals will be connected to Russia and Yemen, as highlighted in Figure 4.
Fig. 5. Similar time-series of Yemen and
Russia based on the count (x-axis) and the appearance in the intelligence
reports over the whole time period (y-axis).
Our results show that most likely Nicolai will deal with Saleh Ahmed, and these deals will be connected to Russia and Yemen, as highlighted in Figure 4.
Mini Challenge 1.2 in the next section will present further analysis of the social
network. To underline our hypotheses in this chapter, we use different network
layout and weight the nodes with various measures.
References
1.
U. Fayyad, G.
Piatetsky-shapiro, P. Smyth, and T. Widener. The kdd process for extracting
useful knowledge from volumes of data. Communications of the ACM, 39:27–34,
1996.
2. Foster,
I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (1999)
3.
J. R. F. T.
Grenager and C. Manning. Incorporating non-local information into information
extraction systems by gibbs sampling, 2005. [Online; accessed 15-May-2010].
4. F. B.
Viegas, M. Wattenberg, F. van Ham, J. Kriss, and M. Mckeon. Manyeyes: a
site for visualization at internet scale. IEEE Trans Vis Comput Graph,
13(6):1121–1128, 2007.
MC1.2: Illustrate the associations among the players in the arms
dealing through a social network. If there are linkages among countries, please
highlight these as well in the social network. Our analysts are interested in
seeing different views of the social network that might help them in
counterintelligence activities (people, places, activities, communication
patterns that are key to the network).
1 Introduction
The second
task was to visualize the relationships among the players and countries
involved in dealing in arms through a social network. This network was to help
the counterintelligence to prevent further activities in different countries.
The primary objective was to detect the central actors in the network with
different analysis methods.
The major
problem in this task was the extraction of the entities needed for such a
visualization and analysis task. Our dataset consisted of 91 text records.
These records are part of newspaper
accounts,
emails, message board entries, web sites, blog postings, telephone calls, bank
transactions and observations. Our task included the extraction of persons,
countries, organizations and even the date for possible further activities from
this dataset.
2 Knowledge Discovery in Text and Network Analysis
This
introduction about the task and possible ensuing problems shows that the
selection, pre-processing and transformation (KDD [2]) steps are very important
for a valid analysis and the creation of useful social networks. It was obvious
that we needed Named-Entity-Recognition (NER) for the information extraction
part. To avoid starting from scratch, our first step was a web retrieval
session and consisted of examining existing applications in this area of
research. This session took about 2-3 hours. Having tested some tools, we chose
a NER system from the Stanford Natural Language Processing Group [3]. After a
configuration step for determining the right classifier
(classifiers/nereng-ie.crf-3-all2008.ser), we were satisfied with a first
console output containing locations, names and organizations.
Based on this
output, we were able to start our own implementation of a tool that included
almost all KDD steps, except for the evaluation part.
Fig. 1. Extracted actors after the preprocessing step.
For a useful
data input, we separated the records by blank lines and used each data item as
an independent corpus. Therefore, we wrote a little script which separates the
records and saves them in different files. The prepared files were going to be
the input for our application. Generally, the interface of our application can
be separated in three main windows. The first shows the information extraction
part. The second supports the manual analysis (Figure 1) and enables us to get
an overview with a bar-chart visualization. The third one is suited for the
network analysis task. The read reports button on the first tab invokes the
pre-processing step as described, and visualizes the output of the different
text records. The findings for one item could look like this:
Locations: Nairobi,
Kenya, Dubai, Moscow Persons: Nahid Organizations: Tanya
Date: 1.04.2009 .
The date
category is an extension of the Stanford system. We have also deleted duplicate
names, organizations and location entries within one item. In addition, we did
some manual work with the summarizing of cities to the according country. The
manual editing of the data set is supported by a feature of our application.
Another
important decision that was based on a manual approach was the weighting of the
U.S.A. as node. The cardinality of the U.S.A. in the whole dataset is high. But
after a manual examination of the textual records, it turned out the this high
cardinality of the U.S.A. is related to the authors of the records. We were
interested in the actors of the arms dealing network, therefore it was
reasonable to delete the U.S.A as an actor in the network.
In this part,
we combined manual and automatical preprocessing. My partner worked 2 hours at
the manual preprocessing. Implementing the automatical support took us further
hours.
Fig. 2. Using our tool for analysis of persons, countries and organizations.
The resulting
output is a great fundament for further analysis. At this point, we had already
summarized each event with its important actors, the according date and where
the event has taken place. The next step was an import into a relational
database system. Therefore, we imported the four needed attributes in the
database. Starting with textual information, we achieved a possibility to work
with a database management system with all its advantages.
This achieved
result turns the determination of the number of each actor into a simple
aggregation of all items in one attribute. At the bottom of the first window,
this count is shown for all used attributes.
Having
finished this challenge, we began to work at the analytical part. We started
with a barchart visualization (using an existing java bib.) of the accordant
attributes. The y-axis represents the cardinality and the x-axis the different
persons, organizations and countries. This visualization in our second
interface window is very useful for getting a brief overview and for
determining a benchmark for the number of nodes in our social network
visualization. The third component of our application supports the
visualization of social networks. This part is based on IBMs Many eyes [4].
Our tool
enables us to visualize a country network, a social network and a filtered view
of these diagrams (Figure 2). The filtering can be accomplished with the
bar-chart in the second window. This is of course very useful for a more
precise perspective of the main actors and countries.
Our first
analysis based on the resulting network showed us that the main actors are
separated into 4 distinct groups. The first sub graph contains Thaniti Otieno,
Nahid Owiti and Vanjdhi Onyango.
Nicolai
Kuryakin and Saleh
Ahmed represent the
second graph. Maulana Haq Bukhari and Akram Basra build the third, Muhammad
Kasem and Abdullah Khouri the fourth graph. These distinct subgraphs provide
less information. Therefore, we combined the most important countries with the
most frequent actors and their relationships. The result is an interesting
graph, as depicted in Figure 3.
This representation leads us to the assumption that
Nicolai works with Ukraine and Saleh Ahmed with Yemen.
Fig. 3. Social network of all actors above a certain threshold.
Furthermore, both work with Russia and act as
central persons in the whole network. The main countries in the dealing network
are Yemen, Russia and Ukraine. The filtering of the other actors and countries
is based on the inspection of each actor with the tool on tab three of our
application. This reflects that the other actors do not figure as centrally as
the aforementioned two persons. To underline this hypothesis we manually
integrated the graph into a network analysis application called visone [1]. By
analyzing the graph with the betweeness measure and a centralized layout based
on this value (Figure 4). It is obvious that the network underlines our
hypothesis. If we map the view on the cardinality, two other actors appear:
Muhammad Kasem and Akram Basri, who are already known to us from the results of
the first step of the analysis.
Fig. 4. Visualization of the betweeness value. This
useful to determine central actors.
Therefore,
these actors are also strongly conspicuous.
To determine a hierarchy out of the called person we chose a status
layout based on the betweeness. The node size reflects the corresponding value
(Figure 5) . This is further evidence for the central role of Russia as country
and Saleh Ahmed and Nicolai Kuryakin as players in the arms dealing network.
Fig. 5. Hierarchical ordering of the arms deal network.
Future
activities can also be analyzed with our application. There is a means to
visualize the appearance of countries over time. We used an existing Java bib.
for this implementation (Figure 2). The result of this extension is that Russia
and Yemen had a constant number of occurrences in the arms dealing network and
thus ought to be under strong observation.
References
1.
U. Brandes and
D. Wagner. Visone, 2009. [Online; accessed 28-May-2010].
2.
U. Fayyad, G.
Piatetsky-shapiro, P. Smyth, and T. Widener. The kdd process for extracting
useful knowledge from volumes of data. Communications of the ACM, 39:27–34,
1996.
3. Foster,
I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (1999)
4.
J. R. F. T.
Grenager and C. Manning. Incorporating non-local information into information
extraction systems by gibbs sampling, 2005. [Online; accessed 15-May-2010].
5. F. B.
Viegas, M. Wattenberg, F. van Ham, J. Kriss, and M. Mckeon. Manyeyes: a
site for visualization at internet scale. IEEE Trans Vis Comput Graph,
13(6):1121–1128, 2007.
Conclusions of both MC1 and MC2:
Mini Challenge 1 shows that most likely Nicolai will deal with Saleh Ahmed, and these deals will be connected to Russia and Yemen. After answering the question it is clear that Russia, Ukraine and Yemen are the main settings in the arms dealing network. The United Arab Emirates is a beside setting that includes criminal risk. Due to the time series in our tool we can predict possible activities in these countries.
The second Mini Challenge helps us to underline this assumptions and even expand some of them. Saleh Ahmed and Nicolai are main actors in the network and act as a transmitter (maximum betweeness value) for all criminal activities. Building up a hierarchy of the most weighted nodes just underlines all the hypotheses. Based on this evidences it would be a important preventing step if the security agencies would arrest these two persons. The other actors also exude danger but based on our findings we think that they are lost without the main protagonists.